========================================================
## [1] "/Users/Dalal/Desktop"
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
in the Explore and Summarize Data project i will explore Red wine Data set, the main objective of this project is to explore the chemical variables that have impact on the wine this data set contain 12 variables and 1599 observations.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
From the summary i see that the Red wines have:
Chlorides: between 0.012 and 0.611 and the Mean value = 0.087 PH: between 2.740 and 4.010 and the Mean value = 3.311 alcohol: between 8.40 and 14.90 and the mean value = 10.42 Quality: between 3.000 and 8.000 and the Mean value = 5.636
The quality of red wine is normally distributed around 5, thats mean the quality of red wine collection is good
To see all the chemical variables that have the impact on the wine, and i fount that (residual sugar, chlorides, free sulfur dioxide , total sulfur dioxide, and sulphates) are positive skew and the (density and PH ) are normally distributed.
i grouped the free and total sulfur dioxide together and, from the histogram above i see that both free and total sulfur.dioxide have normal distributions.
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
i grouped the acids together and, from the histogram above i see that: The fixed acidity of red wine between 5 to 11 and The Volatile acidity of red wine between 0.2 - 0.8 and The citric acid of red wine between 0.01 - 0.50
from the chart above it seems that the alcohol content follow an abnormal distribution and it contains a high peak at the lower.
From the histogram above i define a new variable calles alcohol density depend on the alcohol to see the density of the alcohol and see the highest density low Alcohol with alcohol between (8.3 to 10.5) Medium Alcohol with alcohol between (10.55 to 12.5 ) high Alcohol with alcohol between (12.6 to 14.9) and i found that low has the highest Alcohol the count is around 1000 and then Medium has Alcohol around 550 and High has Alcohol around 60.
From the histogram above i define a new variable calles alcohol quality depend on quality low quality is < 5 medium < 7 v.good > 7 to see the highest average alcohol quality and found that is medium is the highest
## 'data.frame': 1599 obs. of 14 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ alcohol.density : Ord.factor w/ 3 levels "Low Alcohol"<..: 1 1 1 1 1 1 1 1 1 2 ...
## $ alcohol.quality : Ord.factor w/ 3 levels "low"<"Medium"<..: 2 2 2 2 2 2 2 3 3 2 ...
wine Dataset contain 1599 observation and 12 variables (fixed acidity - volatile acidity - citric acid - residual sugar chlorides - free sulfur dioxide - total sulfur dioxide - density - pH - sulphates - alcohol- quality )
and i create 2 variables alcohol category and alcohol quality.
The wine Quality and Alcohol is the main features.
sulphates and density.
yes, alcohol density and alcohol quality
yes i observed some unusual distribution with the fixed acidity, citric acid, volatileacidity, free sulfur dioxide and total sulfur dioxide variables and i use log10 and to understand the distribution better.
from the chart above i see that fixed acidity is increasing with density.
from the chart above i see that density is increasing while the alcohol decreasing
from the chart above i see that the relationship is negative, lower pH correlates with higher fixed acidity
from the chart above i see that residual sugar is increasing with density.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
From the chart above i see that there is negative correlation between: citric acid and volatile acidity
from the chart above i see that the wine with High alcohol has the lowest meadian density.
from the chart above i see that the wine with the v.good quality has the lowest meadian volatile acidity.
I see there is a Positive correlation between: (fixed acidity with density) and (residual sugar with density). and Negative correlation between: (density with alcohol) and (fixed acidity with PH)
From the box plot, i see that there is a negative correlation between: (Alcohol quality with volatile.acidity). and positive correlation between: (Alcohol quality with alcohol).
negative correlation between (citric acid and volatile acidity) positive relationship between (fixed.acidity and density)
alcohol quality with alcohol.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the previous plots i want to explore how the chemical variables interacts with Alcohol. High alcohol density has low (chlorides, volatile acidity and density) and free sulfur dioxide increasing with high alcohol density.
From the previous plots i want to explore correlation between alcohol quality and alcohol i found that the wine with high quality has medium alcohol, lower density , high citirc acid and low volatile acidity
From the previous chart i found that fixed acidity and density increasing but PH is decreasing.
i observed that: -High alcohol density has low (chlorides, volatile acidity and density) -free sulfur dioxide increasing with high alcohol density. - wine with high quality has medium alcohol, lower density , high citirc acid and low volatile acidity -
High quality wines have lower volatile acidity.
The quality of red wine showes that 80% have a good quality and the chart shows that is normally distributed around 5 to 6
The charts above is normally distributed the average of free.sulfur.dioxide is 10
The chart above shows that there is negative relationship between (fixed acidity and PH)
in the begin of the project i try to know the dataset better and i found there is 1,599 observations with 13 variables and i remove x coulomn so there is 12 variables. and i notice throw the analyze process that the Quality has strong effect on the variables so i define a new variable “alcohol quality” to see it clearly in chart.
in the future when i have best quality for wine the graph for alcohol quality that i ploted will change and the (v.good) will change to become the highest value insted of medium. and as a next step i will develop a statistical model for Red wine dataset.